Tēzaurs.lv: the Largest Open Lexical Database for Latvian
نویسندگان
چکیده
We describe an extensive and versatile lexical resource for Latvian, an under-resourced Indo-European language, which we call Tezaurs (Latvian for ‘thesaurus’). It comprises a large explanatory dictionary of more than 250,000 entries that are derived from more than 280 external sources. The dictionary is enriched with phonetic, morphological, semantic and other annotations, as well as augmented by various language processing tools allowing for the generation of inflectional forms and pronunciation, for on-the-fly selection of corpus examples, for suggesting synonyms, etc. Tezaurs is available as a public and widely used web application for end-users, as an open data set for the use in language technology (LT), and as an API – a set of web services for the integration into third-party applications. The ultimate goal of Tezaurs is to be the central computational lexicon for Latvian, bringing together all Latvian words and frequently used multi-word units and allowing for the integration of other LT resources and tools.
منابع مشابه
Finite State Morphology Tool for Latvian
The existing Latvian morphological analyzer was developed more than ten years ago. Its main weaknesses are: low processing speed when processing a large text corpus, complexity of adding new entries to the lexical data base, and limitations for usage on different operational platforms. This paper describes the creation of a new Latvian morphology tool. The tool has the capability to return lemm...
متن کاملIdentification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach
We discuss an experiment on automatic identification of bi-gram multiword expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown som...
متن کاملSimple PPDB: A Paraphrase Database for Simplification
We release the Simple Paraphrase Database, a subset of of the Paraphrase Database (PPDB) adapted for the task of text simplification. We train a supervised model to associate simplification scores with each phrase pair, producing rankings competitive with state-of-theart lexical simplification models. Our new simplification database contains 4.4 million paraphrase rules, making it the largest a...
متن کاملSimple PPDB: A Paraphrase Database for Simplification
We release the Simple Paraphrase Database, a subset of of the Paraphrase Database (PPDB) adapted for the task of text simplification. We train a supervised model to associate simplification scores with each phrase pair, producing rankings competitive with state-of-theart lexical simplification models. Our new simplification database contains 4.5 million paraphrase rules, making it the largest a...
متن کاملModernized Latvian Ergonomic Keyboard
Increasingly more people use computers and create content using keyboards (even with leading edge touch-screen technology). As in the most part of the world, in Latvia also conventional "Qwerty" keyboard is used. Though for Latvian it is much worse than for English, especially due to enormous load to little fingers. It causes repetitive strain injuries and affects productivity of workers with e...
متن کامل